64 research outputs found

    Thermodynamic simulation of deoxyoligonucleotide hybridization, polymerization, and ligation

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (leaves 54-55).by Alexander J. Hartemink.M.S

    Learning a Hybrid Architecture for Sequence Regression and Annotation

    Full text link
    When learning a hidden Markov model (HMM), sequen- tial observations can often be complemented by real-valued summary response variables generated from the path of hid- den states. Such settings arise in numerous domains, includ- ing many applications in biology, like motif discovery and genome annotation. In this paper, we present a flexible frame- work for jointly modeling both latent sequence features and the functional mapping that relates the summary response variables to the hidden state sequence. The algorithm is com- patible with a rich set of mapping functions. Results show that the availability of additional continuous response vari- ables can simultaneously improve the annotation of the se- quential observations and yield good prediction performance in both synthetic data and real-world datasets.Comment: AAAI 201

    Finding regulatory DNA motifs using alignment-free evolutionary conservation information

    Get PDF
    As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do

    Distinguishing direct versus indirect transcription factor–DNA interactions

    Get PDF
    Transcriptional regulation is largely enacted by transcription factors (TFs) binding DNA. Large numbers of TF binding motifs have been revealed by ChIP-chip experiments followed by computational DNA motif discovery. However, the success of motif discovery algorithms has been limited when applied to sequences bound in vivo (such as those identified by ChIP-chip) because the observed TF–DNA interactions are not necessarily direct: Some TFs predominantly associate with DNA indirectly through protein partners, while others exhibit both direct and indirect binding. Here, we present the first method for distinguishing between direct and indirect TF–DNA interactions, integrating in vivo TF binding data, in vivo nucleosome occupancy data, and motifs from in vitro protein binding microarray experiments. When applied to yeast ChIP-chip data, our method reveals that only 48% of the data sets can be readily explained by direct binding of the profiled TF, while 16% can be explained by indirect DNA binding. In the remaining 36%, none of the motifs used in our analysis was able to explain the ChIP-chip data, either because the data were too noisy or because the set of motifs was incomplete. As more in vitro TF DNA binding motifs become available, our method could be used to build a complete catalog of direct and indirect TF–DNA interactions. Our method is not restricted to yeast or to ChIP-chip data, but can be applied in any system for which both in vivo binding data and in vitro DNA binding motifs are available.National Science Foundation (U.S.). (CAREER Award 0347801

    Principled computational methods for the validation discovery of genetic regulatory networks

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 193-206).As molecular biology continues to evolve in the direction of high-throughput collection of data, it has become increasingly necessary to develop computational methods for analyzing observed data that are at once both sophisticated enough to capture essential features of biological phenomena and at the same time approachable in terms of their application. We demonstrate how graphical models, and Bayesian networks in particular, can be used to model genetic regulatory networks. These methods are well-suited to this problem owing to their ability to model more than pair-wise relationships between variables, their ability to guard against over-fitting, and their robustness in the face of noisy data. Moreover, Bayesian network models can be scored in a principled manner in the presence of both genomic expression and location data. We develop methods for extending Bayesian network semantics to include edge annotations that allow us to model statistical dependencies between biological factors with greater refinement. We derive principled methods for scoring these annotated Bayesian networks. Using these models in the presence of genomic expression data requires suitable methods for the normalization and discretization of this data.(cont.) We present novel methods appropriate to this context for performing each of these operations. With these elements in place, we are able to apply our scoring framework to both validate models of regulatory networks in comparison with one another and discover networks using heuristic search methods. To demonstrate the utility of this framework for the elucidation of genetic regulatory networks, we apply these methods in the context of the well-understood galactose regulatory system and the less well-understood pheromone response system in yeast. We demonstrate how genomic expression and location data can be combined in a principled manner to enable the induction of models not readily discovered if the data sources are considered in isolation.by Alexander John Hartemink.Ph.D

    SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data

    Get PDF
    Accuracy of trajectory reconstruction using a subset of cells. (a) Graph showing how similar the SLICER trajectory is when computed using a random subset of lung cells. The blue bars show the similarity in cell ordering (units are percent sorted with respect to the trajectory constructed from all cells). The orange bars show the similarity in branch assignments (percentage of cells assigned to the same branch as the trajectory constructed from all cells). The values shown were obtained by averaging the results from five subsampled datasets for each percentage (80 %, 60 %, 40 %, and 20 %). (b) Order preservation and branch identity values computed as in panel (a), but for datasets sampled from the neural stem cell dataset. (PDF 106 kb

    A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast

    Get PDF
    Finding functional DNA binding sites of transcription factors (TFs) throughout the genome is a crucial step in understanding transcriptional regulation. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known TF motifs occur in the genome than are actually functional. However, information about chromatin structure may help to identify the functional sites. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling TFs to bind DNA in those regions. Here, we describe a novel motif discovery algorithm that employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy. When a Gibbs sampling algorithm is applied to yeast sequence-sets identified by ChIP-chip, the correct motif is found in 52% more cases with our informative prior than with the commonly used uniform prior. This is the first demonstration that nucleosome occupancy information can be used to improve motif discovery. The improvement is dramatic, even though we are using only a statistical model to predict nucleosome occupancy; we expect our results to improve further as high-resolution genome-wide experimental nucleosome occupancy data becomes increasingly available

    Core and region-enriched networks of behaviorally regulated genes and the singing genome

    Get PDF
    Songbirds represent an important model organism for elucidating molecular mechanisms that link genes with complex behaviors, in part because they have discrete vocal learning circuits that have parallels with those that mediate human speech. We found that ~10% of the genes in the avian genome were regulated by singing, and we found a striking regional diversity of both basal and singing-induced programs in the four key song nuclei of the zebra finch, a vocal learning songbird. The region-enriched patterns were a result of distinct combinations of region-enriched transcription factors (TFs), their binding motifs, and presinging acetylation of histone 3 at lysine 27 (H3K27ac) enhancer activity in the regulatory regions of the associated genes. RNA interference manipulations validated the role of the calcium-response transcription factor (CaRF) in regulating genes preferentially expressed in specific song nuclei in response to singing. Thus, differential combinatorial binding of a small group of activity-regulated TFs and predefined epigenetic enhancer activity influences the anatomical diversity of behaviorally regulated gene networks

    Computational Inference of Neural Information Flow Networks

    Get PDF
    Determining how information flows along anatomical brain pathways is a fundamental requirement for understanding how animals perceive their environments, learn, and behave. Attempts to reveal such neural information flow have been made using linear computational methods, but neural interactions are known to be nonlinear. Here, we demonstrate that a dynamic Bayesian network (DBN) inference algorithm we originally developed to infer nonlinear transcriptional regulatory networks from gene expression data collected with microarrays is also successful at inferring nonlinear neural information flow networks from electrophysiology data collected with microelectrode arrays. The inferred networks we recover from the songbird auditory pathway are correctly restricted to a subset of known anatomical paths, are consistent with timing of the system, and reveal both the importance of reciprocal feedback in auditory processing and greater information flow to higher-order auditory areas when birds hear natural as opposed to synthetic sounds. A linear method applied to the same data incorrectly produces networks with information flow to non-neural tissue and over paths known not to exist. To our knowledge, this study represents the first biologically validated demonstration of an algorithm to successfully infer neural information flow networks

    Monthly variation in the probability of presence of adult Culicoides populations in nine European countries and the implications for targeted surveillance

    Get PDF
    Background: Biting midges of the genus Culicoides (Diptera: Ceratopogonidae) are small hematophagous insects responsible for the transmission of bluetongue virus, Schmallenberg virus and African horse sickness virus to wild and domestic ruminants and equids. Outbreaks of these viruses have caused economic damage within the European Union. The spatio-temporal distribution of biting midges is a key factor in identifying areas with the potential for disease spread. The aim of this study was to identify and map areas of neglectable adult activity for each month in an average year. Average monthly risk maps can be used as a tool when allocating resources for surveillance and control programs within Europe. Methods : We modelled the occurrence of C. imicola and the Obsoletus and Pulicaris ensembles using existing entomological surveillance data from Spain, France, Germany, Switzerland, Austria, Denmark, Sweden, Norway and Poland. The monthly probability of each vector species and ensembles being present in Europe based on climatic and environmental input variables was estimated with the machine learning technique Random Forest. Subsequently, the monthly probability was classified into three classes: Absence, Presence and Uncertain status. These three classes are useful for mapping areas of no risk, areas of high-risk targeted for animal movement restrictions, and areas with an uncertain status that need active entomological surveillance to determine whether or not vectors are present. Results: The distribution of Culicoides species ensembles were in agreement with their previously reported distribution in Europe. The Random Forest models were very accurate in predicting the probability of presence for C. imicola (mean AUC = 0.95), less accurate for the Obsoletus ensemble (mean AUC = 0.84), while the lowest accuracy was found for the Pulicaris ensemble (mean AUC = 0.71). The most important environmental variables in the models were related to temperature and precipitation for all three groups. Conclusions: The duration periods with low or null adult activity can be derived from the associated monthly distribution maps, and it was also possible to identify and map areas with uncertain predictions. In the absence of ongoing vector surveillance, these maps can be used by veterinary authorities to classify areas as likely vector-free or as likely risk areas from southern Spain to northern Sweden with acceptable precision. The maps can also focus costly entomological surveillance to seasons and areas where the predictions and vector-free status remain uncertain
    corecore